Segmentation of Handwritten Document Images into Text Lines
نویسندگان
چکیده
There are many governmental, cultural, commercial and educational organizations that manage large number of manuscript textual information. Since the management of information recorded on paper or scanned documents is a hard and time-consuming task, Document Image Analysis (DIA) aims to extract the intended information as a human would (Nagy, 2000). The main subtasks of DIA (Mao et al. 2003) are: i) the document layout analysis, which aims to locate the “physical” components of the document such as columns, paragraphs, text lines, words, tables and figures, ii) the document content analysis, for understanding/labelling these components as titles, legends, footnotes, etc. iii) the optical character recognition (OCR) and iv) the reconstruction of the corresponding electronic document. The proposed algorithms that address the above-mentioned processing stages come mainly from the fields of image processing, computer vision, machine learning and pattern recognition. Actually, some of these algorithms are very effective in processing machineprinted document images and therefore they have been incorporated in the workflows of well-known OCR systems. On the contrary, no such efficient systems have been developed for handling handwritten documents. The main reason is that the format of a handwritten manuscript and the writing style depend solely on the author's choices. For example, one could consider that text lines in a machine-printed document are of the same skew, while handwritten text lines may be curvilinear. Text line segmentation is a critical stage in layout analysis, upon which further tasks such as word segmentation, grouping of text lines into paragraphs, characterization of text lines as titles, headings, footnotes, etc. may be developed. For instance, a task for text-line segmentation is involved in the pipeline of the Handwritten Address Interpretation System (HWAIS), which takes a postal address image and determines a unique delivery point (Cohen et al., 1994). Another application, in which text line extraction is considered as a preprocessing step, is the indexing of George Washington papers at the Library of Congress as detailed by Manmatha & Rothfeder, 2005. A similar document analysis project, called the Bovary Project, includes a text-line segmentation stage towards the transcription of the manuscripts of Gustave Flaubert (Nicolas et al., 2004a). In addition, many recent projects, which focus on digitisation of archives, include activities for document image understanding in terms of automatic or semi-automatic extraction and indexing of metadata such as titles, subtitles, keywords, etc. (Antonacopoulos & Karatzas, 2004, Tomai et al., 2002). Obviously, these activities include text-line extraction.
منابع مشابه
A Survey on Word Segmentation Method for Handwritten Documents
One of the most important and challenging tasks in a handwritten recognition pipeline is the segmentation of handwritten document images into text lines and words. Several problems inherent in handwritten documents such as the difference in the skew angle between text lines or along the same text line, the existence of adjacent text lines or words touching, the existence of characters with diff...
متن کاملPerformance of Statistics Based Line Segmentation System for Unconstrained Handwritten Text
Handwritten character recognition is a technique by which a computer system could recognize characters and other symbols written in natural handwriting. Segmentation decomposes the document image into subcomponents like lines, words and characters. To achieve greater accuracy, segmentation and recognition could not be treated independently. Most of the existing line segmentation methods have li...
متن کاملSouth Indian Tamil Language Handwritten Document Text Line Segmentation Technique with Aid of Sliding Window and Skewing Operations
In document image analysis, Text line segmentation is one of the key components. The segmentation logic presents essential information about skew correction, zone segmentation, and character recognition. The method of document image segmentation into text lines for printed text has seen numerous contributions from fellow research scholars, yet there is scope for tremendous improvement. The key ...
متن کاملSegmentation of Handwritten and Printed Arabic Documents
on this paper, we proposed a new text line segmentation of handwritten and typewriting Arabic document images that uses the Outer Isothetic Cover (OIC) algorithm of a digital object. In the first step, we use this method to segment the composed document into text blocs. In the second step, for each text bloc we will extract the text lines. Finally, line text will be segmented into words or into...
متن کاملWord Extraction and Character Segmentation from Text Lines of Unconstrained Handwritten Bangla Document Images
In this paper, a novel approach for word extraction and character segmentation from the handwritten Bangla document images is reported. At first, a modified Run Length Smoothing Algorithm (RLSA), called Spiral Run Length Smearing Algorithm (SRLSA), is applied for the extraction of words from the text lines of unconstrained handwritten Bangla document images. This technique has helped to overcom...
متن کاملMorphology Based Handwritten Line Segmentation Using Foreground and Background Information
Currently text line segmentation is an important stage of research in historical document processing. Because of inter-line distance variability and base-line skew variability, line segmentation in unconstrained handwritten document is very difficult. The line segmentation task gets complicated, when overlapping or inter-penetration situation occurs between two consecutive text lines. In this p...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012